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Abstract 

Th is paper considers the model introduced 
by iBilu and Liniall 2010J, who study prob- 
lems for which the optimal clustering does 
not change when the distances are per- 
turbed by multiplicative factors. They 
show that even when a problem is NP- 
hard, it is sometimes possible to obtain 
polynomial-time algorithms for instances re- 
silient to large perturbations, e.g. on the 
order of Q(^ / n) for max-cut clustering. 



Awasthi et al.l 2010a | extend this line of 



wor k by considering center-b ased objectives, 
and lBalcan and Liana 2011 1 consider the k- 
median and min-sum objectives, giving effi- 
cient algorithms for instances resilient to cer- 
tain constant multiplicative perturbations. 

Here, we are motivated by the question of 
to what extent these assumptions can be re- 
laxed while allowing for efficient algorithms. 
We show there is little room to improve these 
results by giving NP-hardness lower bounds 
for both the fc-median and min-sum objec- 
tives. On the other hand, we show that mul- 
tiplicative resilience parameters, even only on 
the order of 9(1), can be so strong as to make 
the clustering problem trivial, and we exploit 
these assumptions to present a simple one- 
pass streaming algorithm for the fc-median 
objective. We also consider a model of addi- 
tive perturbations and give a correspondence 
between additive and multiplicative notions 
of stability. Our results provide a close ex- 
amination of the consequences of assuming, 
even constant, stability in data. 



1 Introduction 

Clustering is one of the most widely-used techniques 
in statistical data analysis. The need to partition, 
or cluster, data into meaningful categories naturally 
arises in virtually every domain where data is abun- 
dant. Unfortunately, most of the natural clustering ob- 
jectives, including fc-media n, fc-means, and min-sum 



i 



are NP-hard to optimize [Guha and Khullerl 1199 
Jain et al.l . [2002J. It is, therefore, unsurprising that 



many clustering algorithms used in practice come with 
few guarantees. 

Motiv ated by ove r comin g the hardness re- 
sults, IBilu and Liniall 2010J consider a perturbation 
resilience assumption that they argue is often 
implicitly made when choosing a clustering objective: 
that the optimum clustering to the desired objective 
$ is preserved under multiplicative perturbations 
up to a factor a > 1 to the distances between the 
points. They reason that if the optimum clustering 
to an objective $ is not resilient, as in, if small 
perturbations to the distances can cause the optimum 
to change, then $ is perhaps t he wrong objective to 
be optimizing in the first place. IBilu and Liniall 2010J 
show that for max-cut clustering, instances resilient to 
perturbations of a — 0(y/n) have efficient algorithms 
for recovering the optimum itself. 

Continuing that line of research. I Awasthi et al.l 2010bl | 
give a polynomial time algorithm that finds the opti- 
mum clustering for instances resilient to multiplica- 
tive perturbations of a — 3 for center-based_| cluster- 
ing objectives when centers must come from the data 
(we call this the proper setting), and a = 2 + \/3 
when when the centers do not need to (we call this 
the Steiner setting). Their method relies on a sta- 
bility property implied by perturbation resilience (see 
Section [2]). For the Steiner case, they also prove an 
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For center-based clustering objectives, the clustering is 
defined by a choice of centers, and the objective is a func- 
tion of the distances of the points to their closest center. 



Data Stability in Clustering: A Closer Look 



NP-hardness lowe r boun d of a = 3. Subsequently, 



Balcan and Liana (201 lj consider the proper setting 
and improve the constant past a = 3 by giving a new 
polynomial time algorithm for the A:-median objective 
for a = 1 + V2 sw 2.4 stable instances. 

1.1 Our Results 

Our work further delves into the proper setting, for 
which no lower bounds have previously been shown for 
the stability property. In Section [3] we show that even 
in the proper case, where the algorithm is restricted 
to choosing its centers from the data points, for any 
e > 0, it is NP-hard to optimally cluster (2 — e)-stable 
instances, both for the £>median and min-sum ob- 
jectives (Theorems [5] and [7]). To prove this for the 
min-sum objective, we define a new notion of stabil- 
ity that is implied by perturbation resilience, a notion 
that may be of independent interest. 

Then in Section 0] we look at the implications of as- 
suming resilience or stability in the data, even for a 
constant perturbation parameter a. We show that for 
even fairly small constants, the data begins to have 
very strong structural properties, as to make the clus- 
tering task fairly trivial. When a approaches ~ 5.7, 
the data begins to show what is called strict separa- 
tion, where each point is closer to points in its own 
cluster than to points in other clusters (Theorem . 
We show that with strict separation, optimally clus- 
tering in the very restrictive one-pass streaming model 
becomes possible (Theorem HT1) . 

Finally, in Section [5j we look at whether the pic- 
ture can be improved for clustering data that is stable 
under additive, rather than multiplicative, perturba- 
tions. One hope would be that additive stability 
is a more useful assumption, where a polynomial time 
algorithm for e-stable instances might be possible. Un- 
fortunately, this is not the case. We consider a natural 
additive model and show that severe lower bounds hold 
for the additive notion as well (Theorems [TB] and I2TJ1) . 
On the positive side, we show via reductions that al- 
gorithms for multiplicatively stable data also work for 
additively stable data for a different related stability 
parameter. 

Our results demonstrate that on the one hand, it is 
hard to improve the algorithms to work for low sta- 
bility constants, and that on the other hand, higher 
stability constants can be quite strong, to the point of 
trivializing the problem. Furthermore, switching from 
a multiplicative to an additive stability assumption 
does not help to circumvent the hardness results, and 
perhaps makes matters worse. These results, taken 
together, narrow the range of interesting stability pa- 
rameters for theoretical study and highlight the strong 



role that the choice of constant plays in stability as- 
sumptions. 

One thing to note that there is some difference be- 
tween the very related resilience and stability proper- 
ties (se e Sectional), stability being weaker and more 
general [Awasthi et all 12010b} . Some of our results ap- 
ply to both notions, and some only to stability. This 
still leaves open the possibility of devising polynomial- 
time algorithms that, for a much smaller a, work on 
all the a-perturbation resilient instances, but not on 
all a-stable ones. 



1.2 Previous Work 
On Clustering 

The classical approach in theoretical computer sci- 
ence to dealing with the worst-case NP-hardness 
of clustering has been to develop efficient ap- 
proximation algo rithms for th e va rious cluster- 
lArora et all Il998l 



ing o bjectives 



2004, Bartal et all 2001, Charikar et al., 2002, 



Arva et al. 



ry; 



Kumar et all |2010L Ide la Vega et all I2003J . and 



significant efforts have been exerted to improve 
approximation ratios and to prove lower bounds. 
In particular, for metric fc-median, the best known 
guara ntee is a (3 + e)-approximation Arva et al. . 
2004J . and the best known lower bound is (1 + 1/e)- 



hardness of app roximation Guha and Khullerl . Il999l 



Jain et all I2002J . In the case of metric min-sum, the 



best known res ult is a O(pq l ylog(n ))-approximation 
to the optimum Bartal et all I2001J . 



In contrast, a more recent direction of research has 
been to characterize under what conditions we can 
find a desirable clustering efficiently. Perturbation re- 
silience/stability are, of course, such notions, but they 
are related to other no tions of stability in clustering. 
lOstrovskv et al.l [20061 de monstr a te th e effectiveness 
of Lloyd-type algorithms jLloydl . Il982j on instances 
with the stability property that the cost of the op- 
timal fc-means solution is small compared to the cost 
of the optimal (k — l)-means solution, and their guar 



antees have recently been improved by lAwasthi et al 

2010c]. 



In a different line of work, Bal can et al.l [20081 ] consider 



what stability properties of a similarity function, with 
respect to the ground truth clustering, are sufficient 
to clu ster well. In a related direction, iBalcan et al. 
[2009( 1 argue that, for a given objective $, approxi- 
mation algorithms are most useful when the cluster- 
ings they produce are structurally close to the opti- 
mum originally sought in choosing to optimize $ in 
the first place. They then show that, for many ob- 
jectives, if one makes this assumption explicit - that 
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all c-approximations to the objective yield a cluster- 
ing that is structurally e-close to the optimum - then 
one can recover an e-close clustering in polynomial 
time, surprisingly for values of c below the hardness of 
approximati on constant. Th e assumptions and algo- 
rithms of Ba lcan et alj_ 20031 have subsequent ly been 



carefully analyzed by ISchalekamp et al.l 201d| . 



Stability in Other Settings 



Just as the iBilu and Liniall 2010} notion of stability 
gives conditions under which efficient clustering is pos- 
sible , similar concepts h ave been studied in game the- 



ory. iLipton et al.l [20061 ] propose a notion of stability 
for solution concepts of games. They define a game 
to be stable if small perturbations to the payoff ma- 
trix do not significantly change the value of the game, 
and they show games are generally not stable under 
this definition. The n, in a similar spirit t o the work 



of Bilu and Linial, lAwasthi et al.l 2010aj propose a 



related stability condition for a game, which can be 
leveraged in finding its approximate Nash equilibria. 

The IBilu and Liniall 2010} notion of stability has 
also been studied in the context of metric TSP, for 



which iMihalak et al.l [20L1 1 give efficient algorithms 
for 1.8-perturbation resilient instances, illustrating an- 
other case where a stability assumption can circumvent 
NP-hardness. 



From a different direction. iBen-David et al.l [2006J con- 
sider the stability of clustering algorithms, as opposed 
to instances. They say an algorithm is stable if it 
produces similar clusterings for different inputs drawn 
from the same distribution. They argue that stabil- 
ity is not as useful a notion as had been previously 
thought in determining various parameters, such as 
the optimal number of clusters. 

2 Notation and Preliminaries 

In a clustering instance, we are given a set S of n points 
in a finite metric space, and we denote d : S x S — > 
R>o as the distance function. $ denotes the objective 
function over a partition of S into k clusters which we 
want to optimize over the metric, i.e. $ assigns a score 
to every clustering. The optimal clustering w.r.t. $ is 
denoted as C = {C\, C%, . . . , Ck}- 

For the fc-median objective, we partition S into k 
disjoint subsets {Si, S2, ■ ■ ■ , Sk} and assign a center 
Si E S for each subset Si. The goal is to minimize $, 
which is measured by J2i=i S P es d(p,Si). The cen- 
ters in the optimal clustering is denoted as c\ , . . . , Cfc . 
Clearly, in an optimal solution, each point is assigned 
to its nearest center. For the min-sum objective, S 
is partitioned into k disjoint subsets {Si, S%, . . . , Sk}, 



and the objective is to minimize $, which is measured 

Now, we de fine the perturb ation resilience notion in- 
troduced bv lBilu and Liniall 2010] . 

Definition 1. For a > 1, a clustering instance (S,d) 
is a-perturbation resilient to a given objective <£> 
if for any function d' : S x S — > R>o such that 
yp,q <E S,d(p,q) < d'(p,q) < ad(p,q), there is a 
unique optimal clustering C for $ under d! and this 
clustering is equal to the optimal clustering C for $ 
under d. 

In this paper, we consider the fc-median and min-sum 
objectives, and we thereby investigate the following 
definitions of stability, which are implied by pertur- 
bation resilience, as shown in Sections 13.11 and 13.21 



The fol lowing definition is adapted from lAwasthi et al. 
2010b| . 

Definition 2. A clustering instance {S,d) is a- 
center stable for the k-median objective if for any 
optimal cluster Ci G C with center Ci, Cj € C (j 7^ i) 
with center Cj , any point p £ Ci satisfies ad(p,Ci) < 
d{p,Cj). 

Next, we define a new analogous notion of stability 
for the min-sum objective, and we show in Section [3.21 
that for the min-sum objective, perturbation resilience 
implies min-sum stability. To help with exposition for 
the min-sum objective, we define the distance from a 
point p to a set of points A, 



d{ P ,A) = Y,d{p,q). 



Definition 3. A clustering instance (S, d) is a-min- 

sum stable for the min-sum objective if for all optimal 
clusters Ci,Cj eC (j 7^ i), any point p e Ci satisfies 
ad{p,Ci) < d(p,Cj). 

The above is an especially useful generalization be- 
cause algorithms working under the perturbation re- 
silience assumption often also work for min-sum sta- 
bility. 

3 Lower Bounds 

3.1 The fc-Median Objective 



A wast hi et al 



2010bl ] prove the following connec- 
tion between perturbation resilience and stabil- 
it y. Both their algorit hms and the algorithms 
of iBalcan and Liana 20111 ] crucially use this stability 
assumption. 

Lemma 4. Any clustering instance that is a- 
perturbation resilient for the k-median objective also 
satisfies the a-center stability. 
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Awasthi et al.l 2010b| proved that for a < 3 — e, k- 
median clustering a-center stable instances is NP-hard 
when Ste iner points are al l owed in the data. Af- 
terwards, iBalcan and Liana 201 lj circumvented this 
lower bound and achieved a polynomial time algorithm 
for a = 1 + v2 by assuming the algorithm must choose 
cluster centers from within the data. 

First, we prove a lower bound for the center stable 
property in this more restricted setting, stated in the 
theorem below. This shows there is little hope of 
progress, even for data that is more stable than what 
one could hope for real data sets, i.e. for data where 
each point is nearly twice closer to its own center than 
to any other center. 

Theorem 5. For any e > 0, the problem of solving 
(2 — e)-center stable k-median instances is NP-hard. 

Proof. We reduce from the perfect dominating set 
promise problem, which we prove to be NP-hard (see 
Theorem [^U in Appendix \K$ , where we are promised 
that the input graph G — (V, E) is such that all of 
its smallest dominating sets D are perfect, and we are 
asked to find a dominating set of size at most d. The 
reduction is simple. We take an instance of the NP- 
hard problem PDS-PP on G = (V, E) on n vertices 
and reduce it to an a = 2 — e-center stable instance. 
We define our distance metric as follows. Every vertex 
v € V becomes a point in the fc-center instance. For 
any two vertices (u,v) € E we define d(u,v) — 1/2. 
When (u,v) ^ E, we set d(u,v) = 1. This trivially 
satisfies the triangle inequality for any graph G, as the 
sum of the distances along any two edges is at least 1. 
We set k = d. 

We observe that a k- median solution of cost (n — fc)/2 
corresponds to a dominating set of size d in the PDS- 
PP instance, and is therefore NP-hard to find. We also 
observe that because all solutions of size < d in the 
PDS-PP instance are perfect, each (non-center) point 
in the /c-median solution has distance 1/2 to exactly 
one (its own) center, and a distance of 1 to every other 
center. Hence, this instance is a = (2— e)-center stable, 
completing the proof. □ 

3.2 The Min-Sum Objective 

Analogously to Lemma 01 we can show that a- 
perturbation resilience implies our new notion of a- 
min-sum stability. 

Lemma 6. // a clustering instance is a -perturbation 
resilient, then it is also a-min-sum stable. 



Proof. Assume to the contrary that the instance is 
a-perturbation resilient but is not a-min-sum stable. 



Then, there exist clusters Ci , Cj in the optimal so- 
lution C and a point p € Ci such that ad(p,Ci) > 
d(p,Cj). We perturb d as follows. We define d' such 
that for all points q £ Ci, d'(p,q) = ad(p,q), and for 
the remaining distances, d! = d. Clearly d' is an a- 
perturbation of d. 

We now note that C is not optimal under d' . Namely, 
we can create a cheaper solution C that assigns point 
p to cluster Cj, and leaves the remaining clusters 
unchanged, which contradicts optimality of C. This 
shows that C is not the optimum under d' which con- 
tradicts the instance being a-perturbation resilient. 
Therefore we can conclude that if a clustering instance 
is a-perturbation resilient, then must also be a-min- 
sum stable. □ 



Moreo ver, the min-sum algorithm of lBalcan and Liana 



2011J , which requires a to be bounded from below by 



^ ( irin* CeC \c\--i ) ' actually works on this more general 
condition (see Appendix [Ci for details). This further 
motivates our following lower bound. 

Theorem 7. For any e > 0, the problem of finding 
an optimal min-sum k clustering in (2 — e) -min-sum 
stable instances is NP-hard. 

Proof. Consider the following triangle partition prob- 
lem. Let graph G — (V,E) and |V| = n = 3k, and let 
each vertex have maximum degree of 4. The problem 
of whether the vertices of G can be partitioned into 
sets Vi, V2, ■ ■ ■ , Vk suc h that each Vj contains a tr iangle 
in G is NP-complete Garev and Johnson! Il979| , even 
if the maximum degree of any vertex in the graph is 



4 [van Rooij et all . 12011 



We then take the triangle partition problem on an 
instance G = (V, E) on n vertices and reduce it to 
an a = (2 — e)-min-sum stable instance. We define 
our metric as follows. Every vertex v € V becomes a 
point in the min-sum instance. For any two vertices 
(u,v) e E we define d(u,v) = 1/2. When (u,v) <£ E, 
we set d(u,v) = 1. This satisfies the triangle inequal- 
ity for any graph, as the sum of the distances along 
any two edges is at least 1. 

Now we show that we can cluster our min-sum instance 
into k clusters such that the cost of the min-sum ob- 
jective is exactly n if and only if the original instance 
was a YES instance of the triangle partition problem. 
This follows from two facts: 1) a YES instance of the 
triangle partition maps to a clustering into k — n/3 
clusters of size 3 with pairwise distances 1/2, giving a 
total cost of n, and 2) a balanced clustering with all 
minimum pairwise intra-cluster distances is optimal, 
and hence, a cost of n is the best achievable. 

In the clustering from our reduction, each point has a 
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sum- of- distances to its own cluster of 1, as each point 
has a distance of 1/2 to both other points in its cluster. 
Now we examine the sum-of-distances of any point to 
the points in other clusters. A point has two distances 
of 1/2 (edges) to its own cluster, and because we as- 
sumed a degree bound of 4, it can have at most two 
more distances of 1/2 (edges) into any other cluster, 
leaving the third distance to the other cluster to be 
1. This yields a total cost of at least 2 into any other 
cluster. Hence, it is a — (2 — e)-min-sum stable. So, an 
algorithm for solving (2 — e)-min-sum stable instances 
can be used to solve the triangle partition problem, 
completing the proof. □ 

Finally, we note that it is tempting to restrict the de- 
gree bound to 3 in order to further improve the fac- 
tor in the lower bound. Unfortunately, the triangle 
partition problem on gra phs of maximum degre e 3 is 
polynomial-time solvable van Rooii et all 1 2 llj | , and 
we cannot improve the factor of 2 — e by restricting to 
graphs of degree 3 in this reduction. 

4 Strong Consequences of Stability 

In Section [3l we showed that k- median clustering even 
(2 — e)-center stable instances is TVP-hard. In this 
section we show that even for resilience to constant 
multiplicative perturbations of a > ^(5 + v41) « 5.7, 
the data obtains a property referred to as strict sep- 
aration, namely that all points are closer to all other 
points in their own cluster than to points in any other 
cluster; this property is kn own to be helpful in clus- 



tering [Balcan et all |2008] . Then we show that this 
property renders center-based clustering tasks fairly 
trivial even in the difficult one-pass streaming model. 

4.1 Strict Separation 

To show the strict separation prop erty, we will make 
use of the following lemma used by iBalcan and Liana 



Proof. Let {ci,...,Cfe} be the centers of clusters 
{C u ...,C k }. Define 

Pf = argmaxd(p, r). 

By Lemma [H] we have 

ji \ a ( a ~ 1) j/ \ 
d{Ci,q) > ——d{ci,p) 



a + 1 



and also 



j/ \ a ( a ~ 1) j/ \ 

d{Ci,q) > — — d{Ci,pf). 

a + 1 



Adding the two gives us 

a(a — 1) ,, , a(a — 1) ,, , „ ,, , 
, 1 d(a,p) + d(a,p f ) < 2d(a,q), 

a + 1 a+1 

and by the triangle inequality, we get 

" { '*--p-d( P ,p f )<2d(c i ,q). (1) 



a + 



We also have 



d(ci,q) < d(p,Ci) + d(p,q). 



(2) 



Combining Equations [T] and [2] and by the definition 
of pj , we have 

diet — 1) 
V J d(p,p f ) < 2d(p, Cl ) + 2d(q,p) 

a + l 

< 2d(p,p f )+2d(q,p). 
From the RHS and LHS of the above, it follows that 

<*M > f n, (a 7!] -lW.P/) 



> 



2(a + l) 
a(a — 1) 
2(a + l) 



1 d(p,p'). 



(3) 



Equation [3] follows from the definitions of pj and p' . 
Finally, the statement of the Lemma follows by setting 



a> i(5 + \/41) w5.7. 



a 



201 1| . whose proof follows directly from the triangle 4.2 Clustering in the Streaming Model 



inequality. 

Lemma 8. For any two points p and p' belonging to 
two different centers Ci and Cj in the optimal clus- 



tering of an a-center stable instance, it follows that 
d(ci,p') > a( ^~p d(c 2 ,p). 

Now we can prove the following theorem, which shows 
that even for relatively small multiplicative constants 
for a, center stable, and therefore perturbation re- 
silient, instances exhibit strict separation. 

Theorem 9. Let C = {Ci, . . . ,Ck} be the optimal 
clustering of a ^(5 + v41) -center stable instance. Let 
p,p' £ Ci and q £ Cj with i ^ j, then d{p,q) > 
d(p,p'). 



Here, we turn to what is perhaps one of the hardest 
models in which to come up with good algorithms: the 
one-pass streaming model. In the natural stream- 
ing model for center-based objectives, the learner sees 
the data pi,p2, ■ ■ ■ in one pass, and must, using lim- 
ited memory and time, implicitly cluster the data by 
retaining fc of the points to use as centerso The clus- 
tering is then the one induced by placing each point 



2 More generally, one can imagine the algorithm coming 
up with k centers on its own, possibly from Steiner points 
not contained in the data (or even arbitrary points if the 
metric space is Euclidean). As we stated in the in troduc- 
tion, unlike in the work of lAwasthi et all [2010b| ]. we do 
not consider Steiner points herein. 
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in the cluster to the closest center produced by the 
algorithm. We note that an optimal algorithm in this 
model can be used for the general problem, as one 
can present the data to the algorithm in a streaming 
fashion and then, upon getting the centers, induce the 
corresponding clustering. 

Streaming models have been extensi vely studied 



in th e context of clus te ring o bj ectives lAilon et al. 



2009, 



Charikar et al., 
2003], 



2003L iGuha et all "T2003. 



Muthukrishnanl [2003], where the known approxi- 
mation guarantees are weaker than in the standard 
offline model. We, however, show that an a-center 
stability assumption can make the problem of finding 
the optimum tractable for the fc-median objective, 
in only one pass, and this result extends to other 
center-based objectives such as fc-means. We view this 
not as an advance in the state-of-the-art in clustering, 
but rather as an illustration of how powerful stability 
assumptions can be, even for constant parameter 
values. 

For our result, we can use Theorem [S] to immediately 
give us the following as a corollary. 

Corollary 10. Let C = {C%, . . . ,Ck} be the optimal 
clustering of a ■= (5 + v41) -center stable instance. Any 
algorithm that chooses candidate centers {c[, . . . , c' k } 
such that c\ € C% induces the optimal partition C when 
points are assigned to their closest centers. 

This points to an algorithm that lets us easily and 
efficiently find the optimal clustering. 

Theorem 11. For |(5 + \/41) -center stable instances, 
we can recover the optimal clustering for the k-median 
objective, even in one pass in the streaming model. 

Proof. Consider Algorithm [TJ It proceeds as follows: 
it keeps k centers, and whenever a new point comes 
in, it adds it as a center and removes some point that 
realizes the argmin distance among the current centers. 

Algorithm 1 A one-pass streaming algorithm for 
j(5 + \/4T)-center stable instances 

let pi, pi , ■ ■ ■ be the stream of points 

let C be a set of candidate centers, initialized C = 

{Pi,---,Pk} 

while there is more data in stream do 

receive point p t 

C = CUp t 

let p e argmm {Pj :Pk}eC d(p 3 ,pk) 

C = C\p 
end while 
return C (thereby inducing a clustering C) 

The correctness of this algorithm follows from two ob- 
servations: 



1. By the pigeonhole principle, some pair of any k + 1 
points belong to the same cluster. 

2. Theorem [9] says two points in different clusters 
will not realize the argmin distance. 

Hence, whenever a point is removed as a candidate 
center, it has a partner in the same optimal cluster 
that remains. Hence, once we see a point from each 
cluster, we will have one point from each cluster. By 
Corollary [TU1 this gives us the optimal partition. □ 

5 Additive Stability 

So far, in this paper our notions of stability were 
defined with respect to multiplicative perturbations. 
Similarly, we can imagine an instance being resilient 
with respect to additive perturbations. Consider the 
following definition. 

Definition 12. Let d : S x 5" -> [0, 1], and let < 
< 1. A clustering instance (S, d) is additive f3- 
perturbation resilient to a given objective $ if for 
any function d':SxS—$-R>0 such that Vp, q G 
S, d{p, q) < d'(p, q) < d{p, q) + (3, there is a unique 
optimal clustering C for <& under d! and this clustering 
is equal to the optimal clustering C for $ under d. 

We note that in the definition above, we require all 
pairwise distances between points to be at most I. 
Otherwise, resilience to additive perturbations would 
be a very weak notion, as the distances in most in- 
stances could be scaled as to be resilient to arbitrary 
additive perturbations. 

One possible hope is that our hardness results might 
only apply to the multiplicative case, and that we 
might be able to get polynomial time clustering algo- 
rithms for instances resilient to arbitrarily small addi- 
tive perturbations. We show that this is unfortunately 
not the case - we introduce notions of additive stabil- 
ity, similar to Definitions [2] and[3j and for the k- median 
and min-sum objectives, we show correspondences be- 
tween multiplicative and additive stability. 

5.1 The fc-Median Objective 

Analogously to Definition [2j we can define a notion of 
additive /3-center stability. 

Definition 13. Let d : S x S ->• [0, 1], and let < /3 < 
1. A clustering instance (S,d) is additive /3-center 
stable to the k-median objective if for any optimal 
cluster d £ C with center a, Cj £ C (j 7^ i) with cen- 
ter Cj , any point p g d satisfies d{p, Ci)+/3 < d{p, Cj ) . 

We can now prove that perturbation resilience implies 
center stability. 
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Lemma 14. Any clustering instance satisfying addi- 
tive j3 -perturbation resilience for the k-median objec- 
tive also satisfies additive (3-center stability. 

Proof. The proof is similar to that of Lemmas U and |B] 
and appears in Appendix [B] □ 

We now consider center stability, as in the multiplica- 
tive case. We first prove that additive center stability 
implies multiplicative center stability, and this gives 
us the property that any algorithm for I j^ J -center 
stable instances will work for additive /3-centcr stable 
instances. 

Lemma 15. Any additive (3 '-center stable clustering 
instance for the k-median objective is also (multiplica- 
tive) ( y^-g ) -center stable. 

Proof. Let the optimal clustering be C\ , . . . , Cfc with 
centers c\ , . . . , Cjt respectively of an additive /3-center 
stabile clustering instance. Let p € Ci and let i 7^ j. 
From the stability property, 



d(p,Cj) > d{p,c l )+/3>(3. 



(4) 



We also have that d(p,Ci) < d(p,Cj) — (3, from which 
we can see that 



< 



1 



d(p,Cj)-/3 d(p,a)' 
This gives us 

d(p,Cj) d(p,Cj) 



> 



> 



1 



d(p,a) d(p,Cj)-0 1-/3' 



(5) 



Equation [5] is derived as follows. The middle term, 
for d(p, Cj) > (3 (which we have from Equation @| , is 
monotonically decreasing in d(p,Cj). Using d(p,Cj) < 
1 bounds it from below. Relating the LHS to the RHS 
of Equation [5] gives us the needed stability property. 

□ 

Now we prove a lower bound that shows that the task 
of clustering additive (1/2 — e)-center stable instances 
with respect to the fc-median objective remains NP- 
hard. 

Theorem 16. For any e > 0, the problem of finding 
an optimal k-median clustering in additive (1/2 — e)- 
center stable instances is NP-hard. 

Proof. We use the exact reduction in Theorem [5j in 
which the metric satisfies the needed property that d : 
5x5-> [0, 1]. We observe that the instances from the 
reduction are additive (1/2 — e)-center stable. Hence, 
a polynomial time algorithm for solving fc-median on 



a (1/2 — e)-center stable instance can decide whether a 
PDS-PP instance contains a dominating set of a given 
size, completing the proof. □ 

5.2 The Min-Sum Objective 

Here we define additive min-sum stability and prove 
the analogous theorems for the min-sum objective. 
Definition 17. Let d : S x S -» [0, 1], and let < 
{3 < 1. A clustering instance is additive (3-min-sum 
stable for the min-sum objective if for every point p in 
any optimal cluster Ci, it holds that d(p, Ci) + f3(\Ci\ — 
l)<d(p,C J ). 

Lemma 18. // a clustering instance is additive f3- 
perturbation resilient, then it is also additive /3-min- 
sum stable. 

Proof. The proof is similar to that of Lemmas |4] and [6] 
and appears in Appendix [B] □ 

As we did for the fc-median objective, we can also re- 
duce additive stability to multiplicative stability for 
the min-sum objective. 
Lemma 19. Let t = max cec I I ^ additive B- 

mmcec \C\ — 1 y ' 

min-sum stabile clustering instance for the min-sum 
objective is also (multiplicative) I 1 _ 1 gi t ) -min-sum sta- 
ble. 

Proof. Let the optimal clustering be C\, . . . ,Ck and 
let p G Ci. Let i =/= j. From the stability property, we 
have 

d{p, Cj) > dip, CO + p(\a\ - 1) > 0(1^1 - 1). (6) 

We also have 

d(p,C t )<d(p,Cf)-(3(\C l \-l). 

Taking reciprocals and multiplying by d(j),Cj), we 
have 



d(p,Cj) 

d(p,Ci) d(p,C j )-}3(\C i \-l) 



> 



d( P ,C 3 ) 



> 



> 



> 



\C; 



\Cj\-J3QC4-l) 



(7) 



maxcee \C\ 



maxcec \Cj\ - B(mm Ce c \C\ - F 
1 



l-(3/f 



(8) 



Equation [7] is derived as follows: the previous term, 
for d(p,Cj) > [3{\Ci\ — 1) (which we have from Equa- 
tion [6]), is monotonically decreasing in d(p,Cj). Using 
d{p, Cj) < \Cj\ bounds it from below. Equation [8] gives 
us the needed property. □ 
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Finally, as with the fc-median objective, we show 
that additive min-sum stability exhibits similar lower 
bounds as in the multiplicative case. 

Theorem 20. For any e > 0, the problem of finding 
an optimal min-sum clustering in additive (1/2 — e)- 
min-sum stable instances is NP-hard. 

Proof. We use the exact reduction in Theorem [71 in 
which the metric satisfies the property that d : S x S — > 
[0, 1]. We observe that the instances from the reduc- 
tion are additive (1/2 — e)-min-sum stable. Hence, an 
algorithm for clustering a (1/2 — e)-min-sum stable in- 
stance can decide whether a triangle partition instance 
is a YES or NO instance. □ 

6 Discussion 

The lower bounds in this paper, together with the 
structural properties implied by fairly small constants, 
illustrate the importance parameter settings play in 
data stability assumptions. Moreover, our results 
make us wonder the degree to which the assumptions 
studied herein hold in practice; a study of real datasets 
is warranted. 

Anoth er interesting direction is to relax the assump- 
tions. lAwasthi et al.l 2010bj suggest considering sta- 
bility under random, and not worst-case, perturba- 
tions. iBalcan and Liangl 20111 ] also study a relaxed 
version of the assumption, where the optimal clus- 
tering can change after perturbation, but not by too 
much. It is open to what extent, and on what data, 
all these directions will yield practical improvements. 

Acknowledgements: We thank Maria-Fiorina Bal- 
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suggesting exploring additive stability. We also thank 
Avrim Blum and Santosh Vempala for feedback on pre- 
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APPENDIX 

A Dominating Set Promise Problem 

We will first need some definitions. A dominating set 
in a unweighted graph G = (V, E) is a subset D C V of 
vertices such that each vertex in V \ D has a neighbor 
in D. A dominating set is perfect if each vertex in 
D\V has exactly one neighbor in D. The problems 
of finding the smallest dominating set and smallest 
perfect dominating set are NP-hard. 

Here we introduce a related problem, called the per- 
fect dominating set promise problem. In this 
problem we are promised that the input graph is such 
that all its dominating sets of size less at most d are 
perfect, and we are asked to find a set of cardinality 
at most d. We prove that this version is NP-hard as 
well. 



Theorem 21. The perfect dominating 
promise problem (PDS-PP) is NP-hard. 



set 



Proof. The 3d matching problem (3DM) is as fol- 
lows: let X, Y, Z be finite disjoint sets with m = \X\ = 
\Y\ = \Z\. Let T contain triples (x,y,z) with x £ 
X, y £ Y, z £ Z with L = \T\. M C T is a perfect 3d- 
matching if for any two triples (x±, yi,zi), (x2, J/2, ^2) € 



M, we have x\ ^ X2,yi ^ 2/2, Z\ 7^ zi- We notice that 
M is a disjoint partition. Determining whether a per- 
fect 3<i-matching exists (YES vs. NO instance) in a 
3<i-matching instance is known to be NP-complete. 

Now we reduce an instance of the 3DM problem to 
PDS-PP on G = {V, E). For 3DM elements X, Y, and 
Z we construct vertices Vx, Vy, and Vz, respectively. 
For each triple in T we construct a vertex in set Vr- 
Additionally, we make an extra vertex v. This gives 
V = V x U Vy U V z U V t U {v}. We make the edge set E 
as follows. Every vertex in Vt (which corresponds to 
a triple) has an edge to the vertices that it contains in 
the corresponding 3DM instance (one in each of Vx, 
Vy, and Vz)- Every vertex in Vr also has an edge to 
v. 

Now we will examine the structure of the smallest 
dominating set D in the constructed PDS-PP instance. 
The vertex v must belong to D so that all vertices in 
Vr are covered. Then, what remains is to optimally 
cover the vertices in VxLiVyLiVz - the cheapest solu- 
tion is to use m vertices from Vr , and this is precisely 
the 3DM problem, which is NP-hard. Hence, any so- 
lution of size d = m + 1 for the PDS-PP instance gives 
a solution to the 3DM instance. 

We also observe that such a solution makes a perfect 
dominating set. Each vertex in Vt \ D has one neigh- 
bor in D, namely v. Each vertex in Vx UVyUVz has a 
unique neighbor in D, namely the vertex in Vr corre- 
sponding to its respective set in the 3DM instance. □ 

B Results for Additive Stability 

Lemma 1141 Any clustering instance satisfying addi- 
tive (3 -perturbation resilience for the k-median objec- 
tive also satisfies additive /3-center stability. 

Proof. We prove that for every point p and its center Cj 
in the optimal clustering of an additive /3-perturbation 
resilient instance, it holds that d(p,Cj) > d{p,Ci) + (3 
for any j =/= i. 

Consider an additive /3-perturbation resilient cluster- 
ing instance. Assume we blow up all the pairwise dis- 
tances within cluster Ci by an additive factor of j3. As 
this is a legitimate perturbation of the distance func- 
tion, the optimal clustering under this perturbation is 
the same as the original one. Hence, p is still assigned 
to the same cluster. Furthermore, since the distances 
within Ci were all changed by the same constant fac- 
tor, Ci will remain the center of the cluster. The same 
holds for any other optimal clusters. Since the optimal 
clustering under the perturbed distances is unique it 
follows that even in the perturbed distance function, 



p prefers Cj to c,-, which implies the lemma. 



□ 
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Lemma 1181 If a clustering instance is additive (3- 
perturbation resilient, then it is also additive (3-min- 
sum stable. 



Proof. Assume to the contrary that the instance is 
/3-perturbation resilient but is not /3-min-sum stable. 
Then, there exist clusters Ci,Cj in the optimal so- 
lution C and a point p € Ci such that d(p,Ci) + 
(3(\Ci\ — 1) > d(p,Cj). Then, we perturb d as fol- 
lows. We define a" such that for all points g £ Q, 
d'(p,q) — d(p,q) + /3, and for the remaining distances 
d! = d. Clearly d! is a valid additive /3-perturbation of 
d. 

We now note that C is not optimal under d' . Namely, 
we can create a cheaper solution C that assigns point 
p to cluster Cj, and leaves the remaining clusters 
unchanged, which contradicts optimality of C. This 
shows that C is not the optimum under d' which is 
contradictory to the fact that the instance is additive 
/3-perturbation resilient. Therefore we can conclude 
that if a clustering instance is additive /3-perturbation 
resilient, then must also be additive /3-min-sum sta- 
ble. □ 



Proof. From the definition of ad(A, A), we have 
ad(A, A) 



peAqGC 

< Yl X d ( p > «) 

peAqeC 



J2 a Yl d (P' l) 

p£A q£C 

d{A,C). 



The first inequality comes from A C C and the second 
by definition of min-sum stability. □ 

This, in addition to Lemma [6j can be used to show 
their algorithm can be employed for min-sum stable 
instances. 



C Average Linkage for Min-Sum 
Stability 



In this appendix, we further support the claim that 
algorithms designed for a-perturbation resilient in- 
stances with respect to the min-sum objective can of- 
ten be made to work for data satisfying the more gen- 
eral (and therefore, weaker) a-min-sum stability prop- 
erty. 

O ne such algorithm i s the A verage Linkage algorithm 



of iBalcan and Liana (201 lj . The algorithm requires 
the condition in Lemma [22] to hold, which we can 
prove indeed holds for a-min-sum stable instances 
(their proof of the lemma holds for the more restricted 
class of perturbation- resilient instances) . To state the 
lemma, we first define the distance between two point 
sets, A and B: 



peAqeB 



Lemma 22. Assume the optimal clustering is a-min- 
sum stable. For any two different clusters C and C in 
C and every AcC, it holds that ad{A, A) < d(A, C"). 



